Internet image
One common goal of the microbiome study is to infer dependence relationship between taxa. Relative abundance data of microbiota is compositional data. The classic statistical methods (e.g., Pearson correlation and Spearman rank correlation), however, are invalid for compositional (microbiome) data and leads to spurious correlations.
The review article, “Microbiome datasets are compositional: and this is not optional” (Gloor et al. 2017)summarized three valid statistical methods that deal with compositionality in microbiome relative abundance data to achieve reliable inferred correlation, including proportionality and SparCC, Spiec Easi.
In this mini-project, I focused on illustrating proportionality analysis on the real microbiome data. Lovell and co-authors (Lovell et al. 2015) proposed the proportionality method for relative correlation, as a valid alternative to absolute correlation. In 2017, Quinn and his colleagues developed the R package, propr to implement the proportionality analysis (Thomas P. Quinn et al. 2017).
Based on experiments on simulation datasets, proportionality is a precise indicator of absolute correlation, although sensitivity is limited (Thomas P. Quinn et al. 2017).
This mini-project is aimed to:
illustrate the proportionality analysis on a real dataset, using
propr R package
validate the proportionality method on the real dataset, by checking up the sensitivity and specificity
library(tidyverse)
library(devtools)
# install_github("tpq/propr")
library(propr)
library(Hmisc) # Spearman correlation
library(corrplot) # visualize correlation matrix
library(stringr)The microbiome OTU table and metadata was retrieved from ML Repo
# raw OTU table
raw_otu <- read.csv(file = "https://knights-lab.github.io/MLRepo/datasets/turnbaugh/refseq/otutable.txt",
header=T,
sep = "")The raw OTU table contains 557 features (OTUs) in 281 samples.
Shorten OTU identifiers with keeping only genus and species names,
using stringr package. And then accumulate counts of OTUs
at the species level.
head(raw_otu$X.OTU) ## [1] "NR_147401.1_Sutterella_massiliensis_strain_Marseille-P2435_16S_ribosomal_RNA__partial_sequence"
## [2] "NR_025930.1_Ruminococcus_bromii_strain_ATCC_27255_16S_ribosomal_RNA_gene__partial_sequence"
## [3] "NR_025230.1_Megasphaera_micronuciformis_strain_AIP_412.00_16S_ribosomal_RNA_gene__partial_sequence"
## [4] "NR_024994.1_Lactobacillus_mucosae_strain_S32_16S_ribosomal_RNA_gene__complete_sequence"
## [5] "NR_024750.1_Coprococcus_catus_strain_VPI-C6-61_16S_ribosomal_RNA_gene__partial_sequence"
## [6] "NR_024684.1_Eubacterium_nitritogenes_strain_JCM_6485_16S_ribosomal_RNA_gene__partial_sequence"
# split strings
otu.id <- raw_otu$X.OTU
split_otu.id <- str_split_fixed(otu.id, "_", n=6) %>%
as.data.frame()
# join genus and species fragments
join_split_otu.id <- split_otu.id %>%
mutate(taxa = str_c(V3, V4, sep = " "))
taxa <- join_split_otu.id$taxa
# merge taxa to raw otu table
raw_otu_taxa <- cbind(raw_otu, taxa)
# accumulate counts at the species level
otu_count_splevel <- raw_otu_taxa %>%
gather(key = "sample", value = "count", -c("ID", "X.OTU", "taxa")) %>%
group_by(sample, taxa) %>%
summarise(count_sp = sum(count, na.rm = T)) %>%
arrange(desc(count_sp)) %>%
ungroup()Read in the metadata from URL.
# metadata
meta <- read.csv(file = "https://knights-lab.github.io/MLRepo/datasets/turnbaugh/task-obese-lean-all.txt",
header = TRUE,
sep = "") %>%
mutate(sample = X.SampleID, is.obese = Var) %>%
select(sample, is.obese)
head(meta)The metadata contains 142 samples. Var indicates the
independent, binary variable of interest, Lean and
Obese.
Transpose the OTU table into sample x taxa format which is suitable for proportionality analysis.
otu_count_splevel2 <- otu_count_splevel %>%
spread(key = "taxa", value = count_sp) %>%
column_to_rownames("sample")Now, the resulting OTU count table at the species level contains 517 features/ species, in 281 samples.
Identify low-abundant OTUs and index them in the OTU table.
keep_index <- apply(otu_count_splevel2, 2, function(x) sum(x >= 10) >= 5) # index of low abundant OTUs 154 unique species/taxa remains after dropping the low-abundant taxa.
Add group variables to the resulting OTU table.
otu_count_splevel2_meta <- otu_count_splevel2 %>%
rownames_to_column("sample") %>%
inner_join(meta, by= "sample") %>%
column_to_rownames("sample")
head(otu_count_splevel2_meta) # print first rows # knitr::kable(
# head(otu_count_splevel2_meta[ ,1:8], 10),
# booktabs=TRUE,
# caption = "A table of the first 6 rows of the OTU matrix at the species level"
# )Check if any missing values in the OTU table. Replace missing values with zero if any.
sum(is.na(otu_count_splevel2)) ## [1] 0
# otu_count_splevel2_rmna <- otu_count_splevel2 %>%
# replace(is.na(.), 0)
#
# sum(is.na(otu_count_splevel2_rmna))Calculate a proportionality measurement, rho using
propr function, following the procedures and codes provided
by T.P.Quinn et al. (Thomas P. Quinn et al.
2019)
# calculate the propr object
pr <- propr(counts = otu_count_splevel2, # a count matrix with subjects as rows and features as columns
metric = "rho", # the proportionality metric to calculate, 'rho', 'pho' or 'phs'
ivar = "clr", # reference OTU/ feature for alr.
symmetrize = FALSE,
p = 100) # permutation cycles (100, by default)
# select = keep_index) # filter low-abundant taxa
# pr@matrix # proportionality metrics
# subset the propr object, dropping low-abundant taxa
pr_sub <- subset(pr, select = keep_index)
# pr_sub@matrix # proportionality metrics of high-enough-abundant taxa
# the alternative way to calculate rho statistics
# rho <- perb(otu_count_splevel2,
# ivar = "clr",
# select= keep_index) # subset abundant taxa after calculating proportionality on the complete matrixSet three required arguments in the wrapper propr
function as follows,
counts a count matrix with subjects as rows and
features as columns
metric the proportionality metric to calculate,
‘rho’, ‘pho’ or ‘phs’
ivar reference OTU/ feature
For proportionality, permute false discovery rate (FDR) for a given cut-off, instead of estimating parametric P-value.
# select a good cutoff for 'rho' by permuting the FDR at various cutoffs.
# below, use [0, .05, ..., .95, 1]
pr <- updateCutoffs(pr, cutoff = seq(0, 1,.05)) ## |------------(25%)----------(50%)----------(75%)----------|
# pr@fdr # identify the maximal cutoff at FDR < 5% The table shows that the largest cutoff is 0.20 when the FDR reaches 5% or fewer.
Identify highly-proportional (rho >= 0.7) feature/ species pairs.
pr_sub_high <- pr_sub[">", .7]
pr_sub_high ## Not weighted and not alpha-transformed
## @counts summary: 281 subjects by 153 features
## @logratio summary: 281 subjects by 153 features
## @matrix summary: 153 features by 153 features
## @pairs summary: 2 feature pairs
## @fdr summary: iterations
## See ?propr for object methods
pr_sub_high@pairs # index of highly proportional pairs ## [1] 9246 17256
pr_sub_high_mt <- pr_sub_high@matrix %>%
as.data.frame()
high_propr_pairs <- pr_sub_high_mt %>%
rownames_to_column("partner") %>%
mutate(partner = factor(partner)) %>%
gather(key="pair", value = "rho", -partner) %>%
mutate(pair = factor(pair)) %>%
mutate(partner_index = as.numeric(partner)) %>%
mutate(pair_index = as.numeric(pair)) %>%
filter(partner_index < pair_index) %>%
filter(rho > 0.7) %>%
select(partner, pair, rho) %>%
mutate(rho = round(rho, 2))
knitr::kable(high_propr_pairs,
booktab = TRUE,
caption = "Highly proportional species pairs (rho > 0.7)") | partner | pair | rho |
|---|---|---|
| Blautia luti | Blautia wexlerae | 0.79 |
| Megasphaera indica | Olsenella uli | 0.89 |
Visualize proportionality with a strict cutoff, using
getNetwork function. The package vignette
describes several built-in tools for visualizing proportionality.
getNetwork(pr,
cutoff = 0.7,
d3=F) # not display in 3D in this case Proportionality network based on rho matrics
## IGRAPH 120d145 UN-- 273 22346 --
## + attr: name (v/c), color (v/c), size (v/n), label (v/l), color (e/c)
## + edges from 120d145 (vertex names):
## [1] Natranaerovirga hydrolytica--Rothia mucilaginosa
## [2] Natranaerovirga hydrolytica--Syntrophococcus sucromutans
## [3] Natranaerovirga hydrolytica--Streptococcus lactarius
## [4] Natranaerovirga hydrolytica--Parvimonas micra
## [5] Natranaerovirga hydrolytica--Lactobacillus paraplantarum
## [6] Natranaerovirga hydrolytica--Sphingomonas koreensis
## [7] Natranaerovirga hydrolytica--Gemella sanguinis
## [8] Natranaerovirga hydrolytica--Ralstonia pickettii
## + ... omitted several edges
# plot(pr_sub_high)How to interpret proportionality? Proportionality depends on log-transformation and must get interpreted with respect to the chosen reference.
Interpret clr-based proportionality to signify a coordination that follows the general trend of the data. In other words, these proportional OTUs move together as individuals relative to how most genes move on average (Thomas P. Quinn et al. 2019).
Apply Spearman rank correlation to raw absolute abundance matrix. Consider the resulting Spearman correlation coefficients (absolute correlation) as the true correlations between taxa pairs.
Use Hmisc package to implement Spearman rank correlation
analysis and corrplot package to visualize correlation
matrix.
spearman_corr <- Hmisc::rcorr(as.matrix(otu_count_splevel2),
type = "spearman")
corrplot::corrplot(spearman_corr$r,
method = "color",
tl.pos = "n", # mutate text labels
type="upper") Correlation matrix of 517 species based on Spearman rank correlation method
# extract Spearman coefficients
abs_corr_r <- spearman_corr$r %>%
as.data.frame() %>%
rownames_to_column("partner") %>%
mutate(partner = factor(partner)) %>%
gather(key="pair", value = "sp_coef", -partner) %>%
mutate(pair = factor(pair)) %>%
mutate(partner_index = as.numeric(partner)) %>%
mutate(pair_index = as.numeric(pair)) %>%
filter(partner_index < pair_index) %>%
arrange(partner_index, pair_index)
# extract Spearman p-val
abs_corr_pval <- spearman_corr$P %>%
as.data.frame() %>%
rownames_to_column("partner") %>%
mutate(partner = factor(partner)) %>%
gather(key="pair", value = "sp_pval", -partner) %>%
mutate(pair = factor(pair)) %>%
mutate(partner_index = as.numeric(partner)) %>%
mutate(pair_index = as.numeric(pair)) %>%
filter(partner_index < pair_index) %>%
arrange(partner_index, pair_index)
# combine Spearman coefficients and p-val
abs_corr_r_pval <- abs_corr_r
abs_corr_r_pval$sp_pval <- abs_corr_pval$sp_pval
# proportionality matrix (rho)
propr_rho <- pr@matrix %>%
as.data.frame() %>%
rownames_to_column("partner") %>%
mutate(partner = factor(partner)) %>%
gather(key="pair", value = "rho", -partner) %>%
mutate(pair = factor(pair)) %>%
mutate(partner_index = as.numeric(partner)) %>%
mutate(pair_index = as.numeric(pair)) %>%
filter(partner_index < pair_index) %>%
arrange(partner_index, pair_index)
# combine absolute correlation and relative correlation coefficients
abs_rel_coef <- abs_corr_r_pval
abs_rel_coef$rho_propr <- propr_rho$rho
ggplot(abs_rel_coef, aes(x= sp_coef, y= rho_propr))+
geom_jitter(alpha = 0.5) +
stat_smooth(method = "lm")# contingency table
cont_matrix <- table("Actual" = abs_rel_coef$sp_coef > 0.9,
"Observed" = abs_rel_coef$rho_propr > 0.9)
cont_matrix # print ## Observed
## Actual FALSE TRUE
## FALSE 121453 11869
## TRUE 24 40
prop.table(cont_matrix, 1)[1,1] # specificity## [1] 0.9109749
prop.table(cont_matrix, 1)[2,2] # sensitivity ## [1] 0.625
At the rho (proportionality matrices) cutoff of 0.9, the specificity of the proportionality is 91.1% and the sensitivity 62.5%.
propd functions in the propr package can
test whether proportionality of certain pairs change between binary
experimental groups.
The package vignette explains mathematically the differential proportionality measures.
According to the package vignette,
propd method compares the variance of log-ratio (VLR) for
one pair across groups, considering two types of differential
proportionality,
disjointed proportionality: proportionality of a
pair holds in both groups, but the ratio between the partners changes
between the groups (i.e., the slope of the proportionality
changes)
emergent proportionality: proportionality exits in
only one of the groups (i.e., the strength of the proportionality
changes)
propd function estimates differential proportionality by
calculating \(theta\) for all features
pairs.
The function takes the following arguments as input:
counts a matrix of \(n\) samples (as rows) and \(d\) features (as columns)
group an \(n\)-dimensional vector corresponding to
subject labels
alpha an optional argument to trigger and guide
transformation
p the total number of permutations used to estimate
FDR
counts <- otu_count_splevel2_meta %>% select(-is.obese) # count matrix, sample X taxa
group <- otu_count_splevel2_meta$is.obese # binary classes, obese vs lean
pd <- propd(counts, group, alpha = NA, p=100) # calculate disjointed- and emergent propr. According the package vignette, the propd object contains theta_d and
theta_e (among others), although only theta_d is active by default.
Users can easily change which theta is active the functions
setDisjointed and setEmergent.
theta_d <- setDisjointed(pd) # activate only disjointed diff prop
theta_e <- setEmergent(pd) # activate only emergent diff prop
## Warning in setEmergent(pd): Emergent proportionality not yet validated for
## unequal group sizes.Extract the detailed results from the above objects, including OTU pairs and differential proportionality statistics.
How to identify most important pairs based on the theta_d and theta_e? Qualitatively, the smaller the theta_d, the more important the proportional pairs are. In contrast, the bigger the theta_e, the more important the proportional pairs are.
# estimate FDR using updateCutoffs function
theta_d <- updateCutoffs(theta_d, cutoff = seq(0.91, 1, 0.001)) ## |------------(25%)----------(50%)----------(75%)----------|
theta_d@fdr knitr::kable(head(theta_d@fdr, 10),
caption = "FDRs at various rho cutoffs",
digits = 2) | cutoff | randcounts | truecounts | FDR |
|---|---|---|---|
| 0.91 | 23.50 | 0 | Inf |
| 0.91 | 24.12 | 0 | Inf |
| 0.91 | 26.41 | 1 | 26.41 |
| 0.91 | 27.29 | 1 | 27.29 |
| 0.91 | 28.19 | 1 | 28.19 |
| 0.92 | 29.58 | 1 | 29.58 |
| 0.92 | 32.68 | 1 | 32.68 |
| 0.92 | 34.90 | 2 | 17.45 |
| 0.92 | 41.02 | 2 | 20.51 |
| 0.92 | 45.57 | 2 | 22.78 |
# disjointed diff. proportionality
theta_d_df <- getResults(theta_d)
theta_d_df_sub <- theta_d_df %>%
select(Partner, Pair, theta, lrv, lrv1, lrv2, lrm, lrm1, lrm2) %>%
mutate(lrv = round(lrv,2),
lrv1 = round(lrv1, 2),
lrv2 = round(lrv2, 2),
lrm = round(lrm, 2),
lrm1 = round(lrm1, 2),
lrm2 = round(lrm2, 2))
theta_d_df_sub In the table, theta indicates theta_d in this case.
lrv1 represents log-ratio variance in the
Obese group; lrv2 in the Lean
group. lrm1 represents log-ratio mean in the
Obese group, and lrm2 in the Lean
group.
The column theta equals theta_d matrices. Values of
theta are 0.91 or greater which are relatively large,
indicating that almost all proportional pairs do not obviously change
with experimental conditions (i.e., obese or not). In addition, FDR of
all the proportional pairs are greater than 5%.
Similarly, estimate FDR for the emergent differential proportionality.
theta_e <- updateCutoffs(theta_e)# emergent diff. proportionality
theta_e_df <- getResults(theta_e) Visualize the network based on disjointed differential
proportionality using plot function.
Providing a value of [0,1] to the cutoff argument will index pairs based on a maximum value of theta. Alternately, providing an integer greater than 1 to the cutoff argument will index the top N pairs as ranked by theta.
Note that setting d3 = TRUE will have the
rgl package render the network in 3D.
The igraph package implement visualization of the theta_d object.
net_d <- plot(theta_d, # disjointed proportionality object
cutoff= 300,
d3= F)Disproportionality network based on disjointed proportionality
Each “node” represents a species/taxa while each connecting “edge” represents an indexed pair (i.e. theta < cutoff).
For disjointed proportionality networks, red edges represent
proportional pairs has higher log-ratio mean in Obese
compared to Lean; green edges represent higher log-ratio
mean in Lean compared to Obese.
Importantly, a small number of taxa/ species participants in a large number of the top differentially proportional pairs.
Visualize the network based on emergent differential proportionality
using plot function.
net_e <- plot(theta_e, # emergent proportionality object
cutoff=800,
d3=F) # whether or not display 3-D plot Disproportionality network based on emergent proportionality
For the emergent network, red edges represent an emergency of
proportionality in Obese compared to Lean
(i.e., sudden coordination in Obese) while green edges
represent a lack of proportionality in Obese compared to
Lean (i.e., no coordination in Obese).
The analysis was performed with R version 4.2.1. Version information of all the R packages are as follow,
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] corrplot_0.92 Hmisc_4.7-2 Formula_1.2-4 survival_3.4-0
## [5] lattice_0.20-45 propr_4.3.0 devtools_2.4.5 usethis_2.1.6
## [9] forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10 purrr_0.3.5
## [13] readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 ggplot2_3.3.6
## [17] tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] googledrive_2.0.0 colorspace_2.0-3 deldir_1.0-6
## [4] ellipsis_0.3.2 htmlTable_2.4.1 base64enc_0.1-3
## [7] fs_1.5.2 rstudioapi_0.14 farver_2.1.1
## [10] remotes_2.4.2 fansi_1.0.3 lubridate_1.8.0
## [13] xml2_1.3.3 splines_4.2.1 cachem_1.0.6
## [16] knitr_1.40 pkgload_1.3.1 jsonlite_1.8.2
## [19] broom_1.0.1 cluster_2.1.4 dbplyr_2.2.1
## [22] png_0.1-7 shiny_1.7.3 compiler_4.2.1
## [25] httr_1.4.4 backports_1.4.1 assertthat_0.2.1
## [28] Matrix_1.5-3 fastmap_1.1.0 gargle_1.2.1
## [31] cli_3.4.1 later_1.3.0 htmltools_0.5.3
## [34] prettyunits_1.1.1 tools_4.2.1 igraph_1.3.5
## [37] gtable_0.3.1 glue_1.6.2 Rcpp_1.0.9
## [40] cellranger_1.1.0 jquerylib_0.1.4 vctrs_0.4.2
## [43] nlme_3.1-160 xfun_0.33 ps_1.7.2
## [46] rvest_1.0.3 mime_0.12 miniUI_0.1.1.1
## [49] lifecycle_1.0.3 googlesheets4_1.0.1 scales_1.2.1
## [52] hms_1.1.2 promises_1.2.0.1 RColorBrewer_1.1-3
## [55] yaml_2.3.5 memoise_2.0.1 gridExtra_2.3
## [58] sass_0.4.2 rpart_4.1.19 latticeExtra_0.6-30
## [61] stringi_1.7.8 highr_0.9 checkmate_2.1.0
## [64] pkgbuild_1.3.1 rlang_1.0.6 pkgconfig_2.0.3
## [67] evaluate_0.18 labeling_0.4.2 htmlwidgets_1.5.4
## [70] processx_3.8.0 tidyselect_1.2.0 magrittr_2.0.3
## [73] R6_2.5.1 generics_0.1.3 profvis_0.3.7
## [76] DBI_1.1.3 mgcv_1.8-41 pillar_1.8.1
## [79] haven_2.5.1 foreign_0.8-83 withr_2.5.0
## [82] nnet_7.3-18 modelr_0.1.10 crayon_1.5.2
## [85] interp_1.1-3 utf8_1.2.2 tzdb_0.3.0
## [88] rmarkdown_2.18 urlchecker_1.0.1 jpeg_0.1-9
## [91] grid_4.2.1 readxl_1.4.1 data.table_1.14.4
## [94] callr_3.7.3 reprex_2.0.2 digest_0.6.29
## [97] xtable_1.8-4 httpuv_1.6.6 munsell_0.5.0
## [100] bslib_0.4.1 sessioninfo_1.2.2